Goto

Collaborating Authors

 training over-parameterized deep neural network


An Improved Analysis of Training Over-parameterized Deep Neural Networks

Neural Information Processing Systems

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n^{24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.


Reviews: An Improved Analysis of Training Over-parameterized Deep Neural Networks

Neural Information Processing Systems

While this paper makes a nice contribution to an important problem, I am not sure if it is significant enough for the conference. The overall outline of the analysis follows closely that of [2], and the main new component is the improved gradient lower bound, which is largely based on previous ones in [2] and [16]. Although the improved analysis provides new insight and I find it useful, I do not feel that it will provide a big impact. The other technical contribution on improved trajectory length is also nice but again I feel that it is somewhat incremental. The results seem technically sound; the proofs all look reasonable although I did not verify them thoroughly.


Reviews: An Improved Analysis of Training Over-parameterized Deep Neural Networks

Neural Information Processing Systems

This paper analyzes the convergence of GD and SGD for overparametrized networks, which result in improvements of the overparametrization requirement. Initially the paper received weakly positive reviews. Specifically, they felt that while the contribution follows closely prior work [2,16], the paper still makes a nice contribution, which is insightful and novel. The rebuttal addressed the issues raised by the reviewers. While one reviewer remained concerned as to whether the ideas in the paper can be extended further, upon discussion, the reviewers agreed that the paper should be accepted.


An Improved Analysis of Training Over-parameterized Deep Neural Networks

Neural Information Processing Systems

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size n (e.g., O(n {24})). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.


An Improved Analysis of Training Over-parameterized Deep Neural Networks

Zou, Difan, Gu, Quanquan

Neural Information Processing Systems

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n {24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work. Papers published at the Neural Information Processing Systems Conference.